In this notebook I will use machine learning techniques to predict diamonds price
#importing the libraries
import pandas as pd
import numpy as np
color_mix=['#FA8072','#DC143C','#8B0000','#FFA500','#FF4500']
# importing the data and analyse it
diamonds = pd.read_csv('diamonds.csv')
diamonds.head(15)
| Unnamed: 0 | carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 2 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 3 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 4 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 5 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
| 5 | 6 | 0.24 | Very Good | J | VVS2 | 62.8 | 57.0 | 336 | 3.94 | 3.96 | 2.48 |
| 6 | 7 | 0.24 | Very Good | I | VVS1 | 62.3 | 57.0 | 336 | 3.95 | 3.98 | 2.47 |
| 7 | 8 | 0.26 | Very Good | H | SI1 | 61.9 | 55.0 | 337 | 4.07 | 4.11 | 2.53 |
| 8 | 9 | 0.22 | Fair | E | VS2 | 65.1 | 61.0 | 337 | 3.87 | 3.78 | 2.49 |
| 9 | 10 | 0.23 | Very Good | H | VS1 | 59.4 | 61.0 | 338 | 4.00 | 4.05 | 2.39 |
| 10 | 11 | 0.30 | Good | J | SI1 | 64.0 | 55.0 | 339 | 4.25 | 4.28 | 2.73 |
| 11 | 12 | 0.23 | Ideal | J | VS1 | 62.8 | 56.0 | 340 | 3.93 | 3.90 | 2.46 |
| 12 | 13 | 0.22 | Premium | F | SI1 | 60.4 | 61.0 | 342 | 3.88 | 3.84 | 2.33 |
| 13 | 14 | 0.31 | Ideal | J | SI2 | 62.2 | 54.0 | 344 | 4.35 | 4.37 | 2.71 |
| 14 | 15 | 0.20 | Premium | E | SI2 | 60.2 | 62.0 | 345 | 3.79 | 3.75 | 2.27 |
# importing the data and analyse it
diamonds.tail(15)
| Unnamed: 0 | carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 53925 | 53926 | 0.79 | Ideal | I | SI1 | 61.6 | 56.0 | 2756 | 5.95 | 5.97 | 3.67 |
| 53926 | 53927 | 0.71 | Ideal | E | SI1 | 61.9 | 56.0 | 2756 | 5.71 | 5.73 | 3.54 |
| 53927 | 53928 | 0.79 | Good | F | SI1 | 58.1 | 59.0 | 2756 | 6.06 | 6.13 | 3.54 |
| 53928 | 53929 | 0.79 | Premium | E | SI2 | 61.4 | 58.0 | 2756 | 6.03 | 5.96 | 3.68 |
| 53929 | 53930 | 0.71 | Ideal | G | VS1 | 61.4 | 56.0 | 2756 | 5.76 | 5.73 | 3.53 |
| 53930 | 53931 | 0.71 | Premium | E | SI1 | 60.5 | 55.0 | 2756 | 5.79 | 5.74 | 3.49 |
| 53931 | 53932 | 0.71 | Premium | F | SI1 | 59.8 | 62.0 | 2756 | 5.74 | 5.73 | 3.43 |
| 53932 | 53933 | 0.70 | Very Good | E | VS2 | 60.5 | 59.0 | 2757 | 5.71 | 5.76 | 3.47 |
| 53933 | 53934 | 0.70 | Very Good | E | VS2 | 61.2 | 59.0 | 2757 | 5.69 | 5.72 | 3.49 |
| 53934 | 53935 | 0.72 | Premium | D | SI1 | 62.7 | 59.0 | 2757 | 5.69 | 5.73 | 3.58 |
| 53935 | 53936 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
| 53936 | 53937 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
| 53937 | 53938 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
| 53938 | 53939 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
| 53939 | 53940 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
diamonds['carat'].min()
0.2
diamonds['carat'].max()
5.01
diamonds['depth'].min()
43.0
diamonds['depth'].max()
79.0
diamonds['table'].min()
43.0
diamonds['table'].max()
95.0
diamonds['price'].min()
326
diamonds['price'].max()
18823
diamonds['x'].min()
0.0
diamonds['x'].max()
10.74
diamonds['y'].min()
0.0
diamonds['y'].max()
58.9
diamonds['z'].min()
0.0
diamonds['z'].max()
31.8
Carat : The Weight of Diamonds It's Range In The Dataset From (0.2-5.01)
Cut : Quality of Diamonds Cut And There's 5 Types Of Cuts(Fair, Good, Very Good, Premium, Ideal)
Color :(from J (worst) to D (best))
Clarity : Diamond clarity Refers To How Flawless A Diamond is
Depth : The Depth Of a Diamond Refers To It's Measurement From Top To Bottom and It's Range In The Dataset From (43-79)
table : A Diamond's Table Refers To The Flat Facet Of The Diamond Seen When The Stone Is Face Up and It's Range In The Dataset From (43-95)
Price : Price in US dollars It's Range from(326, 18823)
X : Diamond on the X-axis It's Range From (0-10.74) Length in mm
Y : Diamond on the Y-axis It's Range From (0-58.9) Width in mm
Z : Diamond on the Z-axis It's Range From (0-31.8) Depth in mm
print(diamonds.shape)
(53940, 11)
diamonds.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 53940 entries, 0 to 53939 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 53940 non-null int64 1 carat 53940 non-null float64 2 cut 53940 non-null object 3 color 53940 non-null object 4 clarity 53940 non-null object 5 depth 53940 non-null float64 6 table 53940 non-null float64 7 price 53940 non-null int64 8 x 53940 non-null float64 9 y 53940 non-null float64 10 z 53940 non-null float64 dtypes: float64(6), int64(2), object(3) memory usage: 4.5+ MB
Data Is Classified Into Two Types Continous Data and Categorical Data
Categorical Data Is: Cut, Color and Clarity
Continous Data Is : Carat, Depth, Table, X, Y and Z
cut=diamonds.cut.value_counts()
cut
Ideal 21551 Premium 13791 Very Good 12082 Good 4906 Fair 1610 Name: cut, dtype: int64
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
f, ax = plt.subplots(figsize=(25,10))
ax.pie(cut, labels=cut.keys(), autopct='%1.1f%%')
ax.legend(labels=cut.keys(), loc=2)
font1 = {'family':'serif','color':'black','size':20}
plt.title("Types Of Cut's Percentages", fontdict = font1)
Text(0.5, 1.0, "Types Of Cut's Percentages")
The Percentage Of The Ideal Is The Most Than The Rest Of The Cuts and Fair Is The lowest Percentage So Therefor The Types Of Cuts Is not Uniform
df = sns.load_dataset("diamonds")
sns.barplot(data=df, x="cut", y="price")
font1 = {'family':'serif','color':'black','size':20}
plt.title("Comparison Between Cut and Price", fontdict = font1)
Text(0.5, 1.0, 'Comparison Between Cut and Price')
We See That's The Normal To See The Premium Cut Is The Highest At The Price But What's Not Normal To See That's Fair cut Is The 2nd Highest After The permium Cut Mean while Fair Cut is The Worst Type Of Cuts and How Ideal At The cost Is Smaller Than Very Good Cut And Good cut
color=diamonds.color.value_counts()
color
G 11292 E 9797 F 9542 H 8304 D 6775 I 5422 J 2808 Name: color, dtype: int64
f, ax = plt.subplots(figsize=(25,10))
ax.pie(color, labels=color.keys(), autopct='%1.1f%%')
ax.legend(labels=color.keys(), loc=2)
font1 = {'family':'serif','color':'black','size':20}
plt.title("Types Of color's Percentages", fontdict = font1)
Text(0.5, 1.0, "Types Of color's Percentages")
J Is The Worst Colour And Have The Lowest Percentage While D Is The Best Colour and Have Average Percentage G is An Average Colour And Have The Highest Percentage while E is the 2nd Highest in both Percentge And In The Colour
df = sns.load_dataset("diamonds")
sns.barplot(data=df, x="color", y="price")
font1 = {'family':'serif','color':'black','size':20}
plt.title("Comparison Between color and Price", fontdict = font1)
Text(0.5, 1.0, 'Comparison Between color and Price')
J and I have The worst colour Type But They Have The Highest Price Mean While D and E are The Best Colour Type and They Have The lowest Price
clarity = diamonds.clarity.value_counts()
clarity
SI1 13065 VS2 12258 SI2 9194 VS1 8171 VVS2 5066 VVS1 3655 IF 1790 I1 741 Name: clarity, dtype: int64
f, ax = plt.subplots(figsize=(25,10))
ax.pie(clarity, labels=clarity.keys(), autopct='%1.1f%%')
ax.legend(labels=clarity.keys(), loc=2)
font1 = {'family':'serif','color':'black','size':20}
plt.title("Types Of clarity's Percentages", fontdict = font1)
Text(0.5, 1.0, "Types Of clarity's Percentages")
IF (internally Flawless ) Have The 2nd Lowest Percentage By 3.3% While L1 (Icluded) Have The Lowest Percentage By 1.4% VS2(Very Slightly Included) and Sl1 (Slightly Included) Have Average Percentage In Clarity And they're average in the types of Clarity
df = sns.load_dataset("diamonds")
sns.barplot(data=df, x="clarity", y="price")
font1 = {'family':'serif','color':'black','size':20}
plt.title("Comparison Between clarity and Price", fontdict = font1)
Text(0.5, 1.0, 'Comparison Between clarity and Price')
Slightly Included Have the Highest price Mean While the Internally Flawless that must have the highest price have the second lowest price
It Seems That The Data Having Outliers At Specific Categoricals
For Example In Color As D Must Have The Highest Price But The Contrary Happened
In Clarity Inter Flawless Must Have The Highest Price But On This Dataset IF Happened TO have The 2nd Lowest Price
In Cut Fair Is 2nd Highest price
clarity_cut_table = pd.crosstab(index=diamonds["clarity"], columns=diamonds["cut"])
clarity_cut_table.plot(kind="bar",
figsize=(10,10),
stacked=True)
font1 = {'family':'serif','color':'black','size':20}
plt.title("Clarity vs Cut", fontdict = font1)
Text(0.5, 1.0, 'Clarity vs Cut')
You Can See That Most Of The People Prefer To Buy Diamond Of SI1 Clarity Followed By VS2, SI2, and VS1 The Cut They Prefer Is Ideal, Premium, and very good's Diamond Cut Category People Are Not Taking The Highest Clarity Diamonds Such As IF or VVS1 or Others Are Rready To Sacrifice On Clarity But Are More Focusing On The Cut Of The Diamond
cut_clarity_table = pd.crosstab(index=diamonds["cut"], columns=diamonds["clarity"])
cut_clarity_table.plot(kind="bar",
figsize=(10,10),
stacked=True)
font1 = {'family':'serif','color':'black','size':20}
plt.title("Cut vs Clarity", fontdict = font1)
Text(0.5, 1.0, 'Cut vs Clarity')
people Prefer Ideal Cut Over Any Other Cut Diamonds Followed By Premium And Very Good
people Are Focusing On Cut Than Clarity
color_clarity_table = pd.crosstab(index=diamonds["color"], columns=diamonds["clarity"])
color_clarity_table.plot(kind="bar",
figsize=(8,9),
stacked=True)
font1 = {'family':'serif','color':'black','size':20}
plt.title("Color vs Clarity", fontdict = font1)
Text(0.5, 1.0, 'Color vs Clarity')
People Prefer G Color Followed By E, F, and H
The Clarity They Mostly Prefer SI1 Category.
From The above plots, The Carat Has The Highest Importance Followed by Cut, Color, And Clarity In The Prediction Of The Price Of The Diamonds
diamonds['clarity']=diamonds['clarity'].replace(['IF','VVS1','VVS2','VS1','VS2','SI1','SI2','I1'],[8,7,6,5,4,3,2,1])
diamonds['color'] = diamonds['color'].replace(['D','E','F','G','H','I','J'],[7,6,5,4,3,2,1])
diamonds['cut'] = diamonds['cut'].replace(['Ideal','Premium','Very Good','Good','Fair'],[5,4,3,2,1])
diamonds.head()
| Unnamed: 0 | carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.23 | 5 | 6 | 2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 2 | 0.21 | 4 | 6 | 3 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 3 | 0.23 | 2 | 6 | 5 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 4 | 0.29 | 4 | 2 | 4 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 5 | 0.31 | 2 | 1 | 2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
Here I replaced Every Categorical values Into Numerical One And I Assumed That The Highest One In Every Category Will Takes The Highest Number
The First Column In The Dataset Is Usless So I Will drop it
diamonds.drop('Unnamed: 0',inplace=True,axis=1)
diamonds.head()
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 5 | 6 | 2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | 4 | 6 | 3 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | 2 | 6 | 5 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | 4 | 2 | 4 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | 2 | 1 | 2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
diamonds.describe()
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 |
| mean | 0.797940 | 3.904097 | 4.405803 | 4.051020 | 61.749405 | 57.457184 | 3932.799722 | 5.731157 | 5.734526 | 3.538734 |
| std | 0.474011 | 1.116600 | 1.701105 | 1.647136 | 1.432621 | 2.234491 | 3989.439738 | 1.121761 | 1.142135 | 0.705699 |
| min | 0.200000 | 1.000000 | 1.000000 | 1.000000 | 43.000000 | 43.000000 | 326.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.400000 | 3.000000 | 3.000000 | 3.000000 | 61.000000 | 56.000000 | 950.000000 | 4.710000 | 4.720000 | 2.910000 |
| 50% | 0.700000 | 4.000000 | 4.000000 | 4.000000 | 61.800000 | 57.000000 | 2401.000000 | 5.700000 | 5.710000 | 3.530000 |
| 75% | 1.040000 | 5.000000 | 6.000000 | 5.000000 | 62.500000 | 59.000000 | 5324.250000 | 6.540000 | 6.540000 | 4.040000 |
| max | 5.010000 | 5.000000 | 7.000000 | 8.000000 | 79.000000 | 95.000000 | 18823.000000 | 10.740000 | 58.900000 | 31.800000 |
You Can See that There's 0 values In the columns x, y and z
That's mean that there are diamonds which have no dimensions So I'll Eliminate those values
#Dropping dimensionless features
diamonds = diamonds.drop(diamonds[diamonds['x'] == 0].index)
diamonds = diamonds.drop(diamonds[diamonds['y'] == 0].index)
diamonds = diamonds.drop(diamonds[diamonds['z'] == 0].index)
diamonds.describe()
| carat | cut | color | clarity | depth | table | price | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 53920.000000 | 53920.000000 | 53920.000000 | 53920.000000 | 53920.000000 | 53920.000000 | 53920.000000 | 53920.000000 | 53920.000000 | 53920.000000 |
| mean | 0.797698 | 3.904228 | 4.405972 | 4.051502 | 61.749514 | 57.456834 | 3930.993231 | 5.731627 | 5.734887 | 3.540046 |
| std | 0.473795 | 1.116579 | 1.701272 | 1.647005 | 1.432331 | 2.234064 | 3987.280446 | 1.119423 | 1.140126 | 0.702530 |
| min | 0.200000 | 1.000000 | 1.000000 | 1.000000 | 43.000000 | 43.000000 | 326.000000 | 3.730000 | 3.680000 | 1.070000 |
| 25% | 0.400000 | 3.000000 | 3.000000 | 3.000000 | 61.000000 | 56.000000 | 949.000000 | 4.710000 | 4.720000 | 2.910000 |
| 50% | 0.700000 | 4.000000 | 4.000000 | 4.000000 | 61.800000 | 57.000000 | 2401.000000 | 5.700000 | 5.710000 | 3.530000 |
| 75% | 1.040000 | 5.000000 | 6.000000 | 5.000000 | 62.500000 | 59.000000 | 5323.250000 | 6.540000 | 6.540000 | 4.040000 |
| max | 5.010000 | 5.000000 | 7.000000 | 8.000000 | 79.000000 | 95.000000 | 18823.000000 | 10.740000 | 58.900000 | 31.800000 |
ax = sns.pairplot(diamonds, hue= "cut", palette = "Spectral")
Closer Look At The Continous Categories With Price
sns.set_palette("afmhot")
cols = ['carat','x','y','z','table','depth']
c = 0
fig, axs = plt.subplots(ncols = len(cols), figsize=(20,7))
for i in cols :
sns.scatterplot(data = diamonds,x = diamonds['price'],y = diamonds[i], ax = axs[c])
c+=1
diamonds = diamonds[(diamonds['y'] < 30)]
diamonds = diamonds[(diamonds['z'] < 30) & (diamonds['z'] > 2)]
diamonds = diamonds[(diamonds['table'] < 80) & (diamonds['table'] > 40)]
diamonds = diamonds[(diamonds['depth'] < 75) & (diamonds['depth'] > 45)]
As You Can See No Outliers At Carat With Price And X With Price
Meanwhile There's Outliers At Y, Z, Table, and Depth With Price
So I removed The Outliers According To the Scattered Plot
To Make Sure That I Removed The Outliers
diamonds.shape
(53907, 10)
sns.set_palette("afmhot")
cols = ['y','z','table','depth']
c = 0
fig, axs = plt.subplots(ncols = len(cols), figsize=(20,7))
for i in cols :
sns.scatterplot(data = diamonds,x = diamonds['price'],y = diamonds[i], ax = axs[c])
c+=1
#Examining correlation matrix using heatmap
cmap = sns.diverging_palette(205, 133, 63, as_cmap=True)
cols = (["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"])
corrmat= diamonds.corr()
f, ax = plt.subplots(figsize=(15,12))
sns.heatmap(corrmat,cmap=cols,annot=True)
<AxesSubplot:>
It Seems That There's a lot of correlation
But It Make Sense Because x y z = volume and carat depends on volume As Shown In The plot
So I will Introduce A New Coloum In The Dataset (Volume=x y z)
diamonds["volume"] = diamonds.x * diamonds.y * diamonds.z
diamonds["volume"].head()
0 38.202030 1 34.505856 2 38.076885 3 46.724580 4 51.917250 Name: volume, dtype: float64
diamonds.head()
| carat | cut | color | clarity | depth | table | price | x | y | z | volume | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 5 | 6 | 2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | 38.202030 |
| 1 | 0.21 | 4 | 6 | 3 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | 34.505856 |
| 2 | 0.23 | 2 | 6 | 5 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | 38.076885 |
| 3 | 0.29 | 4 | 2 | 4 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | 46.724580 |
| 4 | 0.31 | 2 | 1 | 2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | 51.917250 |
diamonds = diamonds.drop(columns={"x","y","z"})
diamonds.head()
| carat | cut | color | clarity | depth | table | price | volume | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 5 | 6 | 2 | 61.5 | 55.0 | 326 | 38.202030 |
| 1 | 0.21 | 4 | 6 | 3 | 59.8 | 61.0 | 326 | 34.505856 |
| 2 | 0.23 | 2 | 6 | 5 | 56.9 | 65.0 | 327 | 38.076885 |
| 3 | 0.29 | 4 | 2 | 4 | 62.4 | 58.0 | 334 | 46.724580 |
| 4 | 0.31 | 2 | 1 | 2 | 63.3 | 58.0 | 335 | 51.917250 |
#Examining correlation matrix using heatmap
cmap = sns.diverging_palette(205, 133, 63, as_cmap=True)
cols = (["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"])
corrmat= diamonds.corr()
f, ax = plt.subplots(figsize=(15,12))
sns.heatmap(corrmat,cmap=cols,annot=True)
<AxesSubplot:>
As Seen In The Heatmap There's A Correlation Between The Price and The Volume With The Carat
Test If The Price will Be Normalized Or Not
skY=diamonds['price'].skew()
skY
1.6186409761621152
1.618 Skew Mean That The Data Not Normalized If It's Near Zero So It's Normalized
from sklearn.preprocessing import MinMaxScaler,StandardScaler,PolynomialFeatures
from sklearn import preprocessing
plt.figure(figsize=[15,5])
plt.subplot(1,2,1)
plt.hist(diamonds['price'], bins=50, ec='black', color='#2196f3')
plt.xlabel('Price in thousands')
plt.ylabel('Number of Diamonds')
plt.title(f'Before Log transformation, Skew:{round(skY,3)}')
plt.subplot(1,2,2)
Y = np.log(diamonds['price'])
plt.hist(Y, bins=50, ec='black', color='#2196f3')
plt.xlabel('Price in logs')
plt.ylabel('Number of Diamonds')
plt.title(f'After Log transformation, Skew:{round(sk,3)}')
plt.show()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Input In [49], in <cell line: 15>() 13 plt.xlabel('Price in logs') 14 plt.ylabel('Number of Diamonds') ---> 15 plt.title(f'After Log transformation, Skew:{round(sk,3)}') 16 plt.show() NameError: name 'sk' is not defined
Now The Price Is Normalized
X_notdum = diamonds
figure = plt.figure(figsize=(15,10))
for n, col in enumerate(X_notdum.columns):
ax = figure.add_subplot(3,4,n+1)
ax.set_title(col)
X_notdum[col].hist(ax=ax, bins=50)
figure.tight_layout() #this feature separate the graphs correctly
plt.show()
x=diamonds.drop(["price"],axis =1)
From The Above Curve It Appears As X Needs To Be Scaled
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
Xss = sc_X.fit_transform(x)
train_SS = pd.DataFrame(Xss, columns=['carat', 'cut', 'color', 'clarity', 'depth', 'table',
'volume'])
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5))
ax1.set_title('Before Scaling')
for e in x.columns:
sns.kdeplot(x[e], ax=ax1)
ax2.set_title('After Standard Scaling')
for e in train_SS.columns:
sns.kdeplot(train_SS[e], ax=ax2, legend=None)
plt.show()
from sklearn.preprocessing import MinMaxScaler
mmc_X = MinMaxScaler()
Xmm = mmc_X.fit_transform(x)
train_MM = pd.DataFrame(Xmm, columns=['carat', 'cut', 'color', 'clarity', 'depth', 'table',
'volume'])
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5))
ax1.set_title('Before Scaling')
for e in x.columns:
sns.kdeplot(x[e], ax=ax1)
ax2.set_title('After Min-Max Scaling')
for e in train_MM.columns:
sns.kdeplot(train_MM[e], ax=ax2, legend=None)
plt.show()
# Defining the independent and dependent variables
x=diamonds.drop(["price"],axis =1)
y= diamonds["price"]
x
| carat | cut | color | clarity | depth | table | volume | |
|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 5 | 6 | 2 | 61.5 | 55.0 | 38.202030 |
| 1 | 0.21 | 4 | 6 | 3 | 59.8 | 61.0 | 34.505856 |
| 2 | 0.23 | 2 | 6 | 5 | 56.9 | 65.0 | 38.076885 |
| 3 | 0.29 | 4 | 2 | 4 | 62.4 | 58.0 | 46.724580 |
| 4 | 0.31 | 2 | 1 | 2 | 63.3 | 58.0 | 51.917250 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 53935 | 0.72 | 5 | 7 | 3 | 60.8 | 57.0 | 115.920000 |
| 53936 | 0.72 | 2 | 7 | 3 | 63.1 | 55.0 | 118.110175 |
| 53937 | 0.70 | 3 | 7 | 3 | 62.8 | 60.0 | 114.449728 |
| 53938 | 0.86 | 4 | 3 | 2 | 61.0 | 58.0 | 140.766120 |
| 53939 | 0.75 | 5 | 7 | 2 | 62.2 | 55.0 | 124.568444 |
53907 rows × 7 columns
y
0 326
1 326
2 327
3 334
4 335
...
53935 2757
53936 2757
53937 2757
53938 2757
53939 2757
Name: price, Length: 53907, dtype: int64
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x, y,test_size=0.20, random_state=25)
#libraries
import pandas as pd
from pandas import DataFrame,Series
import matplotlib.pyplot as plt
import numpy as np
import warnings
import seaborn as sns
sns.set(context="notebook", palette="Spectral", style = 'darkgrid' ,font_scale = 1.5, color_codes=True)
from sklearn.preprocessing import MinMaxScaler,StandardScaler,PolynomialFeatures
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression,Ridge,Lasso, ElasticNet,SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV , KFold , cross_val_score
from sklearn.metrics import mean_squared_log_error,mean_squared_error, r2_score,mean_absolute_error
from sklearn.pipeline import Pipeline
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import os
#Show the results of the split
print("Training set has {} samples.".format(x_train.shape[0]))
print("Testing set has {} samples.".format(x_val.shape[0]))
Training set has 43125 samples. Testing set has 10782 samples.
linear regression model provides a sloped straight line representing the relationship between the variables linear regression goal is to find the best fit line that means the error between predicted values and actual values should be minimized The best fit line will have the least error
dlin = LinearRegression()
dlin.fit(x_train, y_train)
dlin_pred = dlin.predict(x_val)
print('####### Linear Regression #######')
print('Score : %.4f' % dlin.score(x_val, y_val))
dlin_r2 = dlin.score(x_val, y_val)
dlin_mse = mean_squared_error(y_val, dlin_pred, sample_weight=None, multioutput='uniform_average', squared=False)
print('')
print('MSE : %0.2f ' % dlin_mse)
print('R2 : %0.2f ' % dlin_r2)
n=x_val.shape[0]
p=x_val.shape[1]
adj_rsquared = 1 - (1 - dlin_r2) * ((n - 1)/(n-p-1))
print('Adjusted R Squared: {}'.format(adj_rsquared))
####### Linear Regression ####### Score : 0.9072 MSE : 1219.40 R2 : 0.91 Adjusted R Squared: 0.9071803954289116
plt.figure(figsize=(7,7))
sns.regplot(y_val, dlin_pred, fit_reg=True)
C:\Users\Menna Fawzy\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='price'>
Decision Tree
Decision Tree From it's Name you can understand it asking some questions to narrow the information till you get the answer that you want It's consist of nodes you have a main node that's called ROOT NODE and it have most of the information that it will be breaked down into smaller nodes that carry less information
from sklearn.tree import DecisionTreeRegressor
dtm = DecisionTreeRegressor(min_samples_split=40, max_features="auto")
dtm.fit(x_train, y_train)
dtm_pred = dtm.predict(x_val)
dtm_r2 = dtm.score(x_val, y_val)
dtm_mse = mean_squared_error(y_val, dtm_pred, sample_weight=None, multioutput='uniform_average', squared=False)
print('####### DecisionTree Classifier #######')
print('Score : %.4f' % dtm.score(x_val, y_val))
print('')
print('MSE : %0.2f ' % dtm_mse)
print('R2 : %0.2f ' % dtm_r2)
n=x_val.shape[0]
p=x_val.shape[1]
adj_rsquared = 1 - (1 - dtm_r2) * ((n - 1)/(n-p-1))
print('Adjusted R Squared: {}'.format(adj_rsquared))
####### DecisionTree Classifier ####### Score : 0.9758 MSE : 622.66 R2 : 0.98 Adjusted R Squared: 0.9757978613237582
plt.figure(figsize=(7,7))
sns.regplot(y_val, dtm_pred, fit_reg=True)
<AxesSubplot:xlabel='price'>
The Random forest classifier
The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set It is basically a set of decision trees from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction
from sklearn.ensemble import RandomForestRegressor
rfm = RandomForestRegressor(n_estimators=500 ,min_samples_split=40, max_features="auto", min_samples_leaf=1, bootstrap=True)
rfm.fit(x_train, y_train)
rfm_pred = rfm.predict(x_val)
rfm_r2 = rfm.score(x_val, y_val)
rfm_mse = mean_squared_error(y_val, rfm_pred, sample_weight=None, multioutput='uniform_average', squared=False)
print('####### RandomForest Classifier #######')
print('Score : %.4f' % rfm.score(x_val, y_val))
print('')
print('MSE : %0.2f ' % rfm_mse)
print('R2 : %0.2f ' % rfm_r2)
n=x_val.shape[0]
p=x_val.shape[1]
adj_rsquared = 1 - (1 - rfm_r2) * ((n - 1)/(n-p-1))
print('Adjusted R Squared: {}'.format(adj_rsquared))
####### RandomForest Classifier ####### Score : 0.9798 MSE : 568.63 R2 : 0.98 Adjusted R Squared: 0.979815779756908
plt.figure(figsize=(7,7))
sns.regplot(y_val, rfm_pred, fit_reg=True)
C:\Users\Menna Fawzy\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='price'>
Support Vector Classifier
The objective of a Linear Support Vector Classifier is to fit to the data you provide returning a best fit hyperplane that divides or categorizes the data
from sklearn.svm import SVR
svrm = SVR(C=1000)
svrm.fit(x_train, y_train)
svrm_pred = svrm.predict(x_val)
svrm_r2 = svrm.score(x_val, y_val)
svrm_mse = mean_squared_error(y_val, svrm_pred, sample_weight=None, multioutput='uniform_average', squared=False)
print('####### support vector Classifier #######')
print('Score : %.4f' % svrm.score(x_val, y_val))
print('')
print('MSE : %0.2f ' % svrm_mse)
print('R2 : %0.2f ' % svrm_r2)
n=x_val.shape[0]
p=x_val.shape[1]
adj_rsquared = 1 - (1 - svrm_r2) * ((n - 1)/(n-p-1))
print('Adjusted R Squared: {}'.format(adj_rsquared))
####### support vector Classifier ####### Score : 0.9401 MSE : 979.67 R2 : 0.94 Adjusted R Squared: 0.9400891173589735
plt.figure(figsize=(7,7))
sns.regplot(y_val, svrm_pred, fit_reg=True)
C:\Users\Menna Fawzy\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='price'>
Bagging Regressor
Bagging regressors are similar to bagging classifiers They train each regressor model on a random subset of the original training set and aggregate the predictions Then the aggregation averages over the iterations because the target variable is numeric
from sklearn.ensemble import BaggingRegressor
bgm = BaggingRegressor(n_estimators=500, max_samples=30000, bootstrap=True, bootstrap_features=False)
bgm.fit(x_train, y_train)
bgm_pred = bgm.predict(x_val)
bgm_r2 = bgm.score(x_val, y_val)
bgm_mse = mean_squared_error(y_val, bgm_pred, sample_weight=None, multioutput='uniform_average', squared=False)
print('####### Bagging Regressor #######')
print('Score : %.4f' % bgm.score(x_val, y_val))
print('')
print('MSE : %0.2f ' % bgm_mse)
print('R2 : %0.2f ' % bgm_r2)
n=x_val.shape[0]
p=x_val.shape[1]
adj_rsquared = 1 - (1 - bgm_r2) * ((n - 1)/(n-p-1))
print('Adjusted R Squared: {}'.format(adj_rsquared))
####### Bagging Regressor ####### Score : 0.9815 MSE : 544.82 R2 : 0.98 Adjusted R Squared: 0.9814708430625516
plt.figure(figsize=(7,7))
sns.regplot(y_test, bgm_pred, fit_reg=True)
C:\Users\Menna Fawzy\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='price'>
MultiLayer Perceptron Regressor
MLPRegressor is an artificial neural network model that uses backpropagation to adjust the weights between neurons in order to improve prediction accuracy MLPRegressor implements a Multi-Layer Perceptron algorithm for training and testing data sets using backpropagation and stochastic gradient descent methods
from sklearn.neural_network import MLPRegressor
mlpreg = MLPRegressor(hidden_layer_sizes=(300, ), activation='relu', solver='adam', alpha=1000, batch_size='auto', max_iter=30000, shuffle=False, random_state=None)
mlpreg.fit(x_train, y_train)
mlp_pred = mlpreg.predict(x_val)
mlp_r2 = mlpreg.score(x_val, y_val)
mlp_mse = mean_squared_error(y_val, mlp_pred, sample_weight=None, multioutput='uniform_average', squared=False)
print("r2:",mlp_r2,"rmse:",mlp_mse)
print('####### MLPRegressor #######')
print('Score : %.4f' % mlpreg.score(x_val, y_val))
print('')
print('MSE : %0.2f ' % mlp_mse)
print('R2 : %0.2f ' % mlp_r2)
n=x_val.shape[0]
p=x_val.shape[1]
adj_rsquared = 1 - (1 - mlp_r2) * ((n - 1)/(n-p-1))
print('Adjusted R Squared: {}'.format(adj_rsquared))
r2: 0.9753447053436244 rmse: 628.6673734627636 ####### MLPRegressor ####### Score : 0.9753 MSE : 628.67 R2 : 0.98 Adjusted R Squared: 0.9753286864961588
pd.DataFrame({
'R-Squared': [dlin_r2, dtm_r2, rfm_r2, svrm_r2,bgm_r2,mlp_r2],
'MSE': [dlin_mse, dtm_mse, rfm_mse, svrm_mse ,bgm_mse,mlp_mse],
},
index=['Linear Regression', 'DecissionTree Classifier', 'RandomForest Classifier','Support Vector Classifier','Bagging Regressor','MultiLayer Perceptron Regressor'])
| R-Squared | MSE | |
|---|---|---|
| Linear Regression | 0.907241 | 1219.395654 |
| DecissionTree Classifier | 0.975814 | 1219.395654 |
| RandomForest Classifier | 0.979829 | 568.631317 |
| Support Vector Classifier | 0.940128 | 979.665118 |
| Bagging Regressor | 0.981483 | 544.819465 |
| MultiLayer Perceptron Regressor | 0.975345 | 628.667373 |
Bagging Regressor is the best model it Have r2 score 0.98 it's the highest r2 score from the rest of the models and it's mean squared error is The smallest number compared to the rest of the models as you can see in the above table
# importing the data and analyse it
testData = pd.read_csv('diamonds_test.csv')
testData.head(15)
| Unnamed: 0 | carat | cut | color | clarity | depth | table | x | y | z | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.30 | Ideal | H | SI2 | 60.0 | 56.0 | 4.41 | 4.43 | 2.65 |
| 1 | 1 | 0.34 | Ideal | D | IF | 62.1 | 57.0 | 4.52 | 4.46 | 2.79 |
| 2 | 2 | 1.57 | Very Good | I | VS2 | 60.3 | 58.0 | 7.58 | 7.55 | 4.56 |
| 3 | 3 | 0.31 | Ideal | H | VS2 | 61.8 | 57.0 | 4.32 | 4.36 | 2.68 |
| 4 | 4 | 1.51 | Good | I | VVS1 | 64.0 | 60.0 | 7.26 | 7.21 | 4.63 |
| 5 | 5 | 0.70 | Very Good | E | SI1 | 59.6 | 63.0 | 5.72 | 5.65 | 3.39 |
| 6 | 6 | 0.51 | Premium | F | SI2 | 58.3 | 61.0 | 5.18 | 5.14 | 3.01 |
| 7 | 7 | 1.55 | Very Good | I | VS1 | 59.0 | 58.0 | 7.56 | 7.63 | 4.48 |
| 8 | 8 | 0.41 | Ideal | D | SI1 | 62.2 | 57.0 | 4.76 | 4.70 | 2.94 |
| 9 | 9 | 0.30 | Very Good | H | VS2 | 62.5 | 58.0 | 4.26 | 4.28 | 2.67 |
| 10 | 10 | 1.23 | Very Good | G | VVS1 | 61.3 | 57.0 | 6.88 | 6.96 | 4.24 |
| 11 | 11 | 2.54 | Very Good | H | SI2 | 63.5 | 56.0 | 8.68 | 8.65 | 5.50 |
| 12 | 12 | 0.90 | Premium | E | SI1 | 59.8 | 58.0 | 6.26 | 6.21 | 3.73 |
| 13 | 13 | 0.90 | Good | E | SI1 | 62.2 | 65.0 | 6.13 | 6.08 | 3.80 |
| 14 | 14 | 0.76 | Very Good | F | VS2 | 62.0 | 58.0 | 5.80 | 5.86 | 3.62 |
#Dropping dimensionless features
testData = testData.drop(testData[testData['x'] == 0].index)
testData = testData.drop(testData[testData['y'] == 0].index)
testData = testData.drop(testData[testData['z'] == 0].index)
testData["volume"] = testData.x * testData.y * testData.z
testData.head()
| Unnamed: 0 | carat | cut | color | clarity | depth | table | x | y | z | volume | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.30 | Ideal | H | SI2 | 60.0 | 56.0 | 4.41 | 4.43 | 2.65 | 51.771195 |
| 1 | 1 | 0.34 | Ideal | D | IF | 62.1 | 57.0 | 4.52 | 4.46 | 2.79 | 56.244168 |
| 2 | 2 | 1.57 | Very Good | I | VS2 | 60.3 | 58.0 | 7.58 | 7.55 | 4.56 | 260.964240 |
| 3 | 3 | 0.31 | Ideal | H | VS2 | 61.8 | 57.0 | 4.32 | 4.36 | 2.68 | 50.478336 |
| 4 | 4 | 1.51 | Good | I | VVS1 | 64.0 | 60.0 | 7.26 | 7.21 | 4.63 | 242.355498 |
testData = testData.drop(columns={"x","y","z"})
testData.drop('Unnamed: 0',inplace=True,axis=1)
def test(testData):
# Data Transformation
testData['clarity']=testData['clarity'].replace(['IF','VVS1','VVS2','VS1','VS2','SI1','SI2','I1'],[8,7,6,5,4,3,2,1])
testData['color'] = testData['color'].replace(['D','E','F','G','H','I','J'],[7,6,5,4,3,2,1])
testData['cut'] = testData['cut'].replace(['Ideal','Premium','Very Good','Good','Fair'],[5,4,3,2,1])
X=DataFrame(testData,columns =['carat','cut','color','clarity','depth','table','volume'])
Y= bgm.predict(X)
upper_bound = Y + bgm_mse
lower_bound = Y - bgm_mse
print(f'The price predited by our model is {round(np.e**Y[0],2)}')
print(f'The price predited by our model is in range between {round(np.e**lower_bound[0],2)} \
and {round(np.e**upper_bound[0],2)}')
testData1=pd.DataFrame({'carat': 0.7,
'clarity': 'SI1' ,
'cut': 'Very Good',
'color': 'E',
'table': 63,
'depth' : 59.6,
'volume' : 109.55
}, index =[0])
test(testData1)
The price predited by our model is inf The price predited by our model is in range between inf and inf
C:\Users\Menna Fawzy\AppData\Local\Temp\ipykernel_9476\682300255.py:12: RuntimeWarning: overflow encountered in double_scalars
print(f'The price predited by our model is {round(np.e**Y[0],2)}')
C:\Users\Menna Fawzy\AppData\Local\Temp\ipykernel_9476\682300255.py:13: RuntimeWarning: overflow encountered in double_scalars
print(f'The price predited by our model is in range between {round(np.e**lower_bound[0],2)} \
C:\Users\Menna Fawzy\AppData\Local\Temp\ipykernel_9476\682300255.py:14: RuntimeWarning: overflow encountered in double_scalars
and {round(np.e**upper_bound[0],2)}')